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Abstract. This work studies external regret in sequential prediction games with 
both positive and negative payoffs. External regret measures the difference between 
the payoff obtained by the forecasting strategy and the payoff of the best action. 
In this setting, we derive new and sharper regret bounds for the well-known ex- 
ponentially weighted average forecaster and for a new forecaster with a different 
multiplicative update rule. Our analysis has two main advantages: first, no prelim- 
inary knowledge about the payoff sequence is needed, not even its range; second, 
our bounds are expressed in terms of sums of squared payoffs, replacing larger first- 
order quantities appearing in previous bounds. In addition, our most refined bounds 
have the natural and desirable property of being stable under rescalings and general 
translations of the payoff sequence. 



1. Introduction 

The study of online forecasting strategies in adversarial settings has 
received considerable attention in the last few years. One of the goals 
of the research in this area is the design of randomized online algorithms 
that achieve a low external regret; i.e., algorithms able to minimize the 
difference between their expected cumulative payoff and the cumula- 
tive payoff achievable using the single best action (or, equivalently, the 
single best strategy in a given class). 

If the payoffs are uniformly bounded, and there are finitely many 
actions, then there exist simple forecasting strategies whose external 
regret per time step vanishes irrespective to the choice of the payoff 

* An extended abstract appeared in the Proceedings of the 18th Annual Confer- 
ence on Learning Theory, Springer, 2005. The work of all authors was supported 
in part by the 1ST Programme of the European Community, under the PASCAL 
Network of Excellence, IST-2002-506778. 

t The work was done while the author was a fellow in the Institute of Advance 
studies, Hebrew University. His work was also supported by a grant no. 1079/04 
from the Israel Science Foundation and an IBM faculty award. 




© 2008 Kluwer Academic Publishers. Printed in the Netherlands. 



varbounds6.tex; 8/02/2008; 18:15; p.l 



2 



sequence. In particular, under the assumption that all payoffs have 
the same sign (say positive), the best achieved rates for the regret 
are of the order of y/X*/n, where X*/n is the highest average payoff 
among all actions after n time steps. If the payoffs were generated by an 
independent stochastic process, however, the tightest rate for the regret 
with respect to a fixed action should depend on the variance (rather 
than the average) of the observed payoffs for that action. Proving such 
a rate in a fully adversarial setting would be a fundamental result, 
and in this paper we propose new forecasting strategies that make a 
significant step towards this goal. 

Generally speaking, one normally would expect any performance 
bound to be maintained under scaling and translation, since the units 
of measurement should not make a difference (for example, predicting 
the temperature should give similar performances irrespective to the 
scale, Celsius, Fahrenheit or Kelvin, on which the temperature is mea- 
sured). However, in many computational settings this does not hold, for 
example in many domains there is a considerable difference between ap- 
proximating a reward problem or its dual cost problem (although they 
have an identical optimal solution). Most of our bounds also assume no 
knowledge of the sequence of the ranges of the payoffs. For this reason it 
is important for us to stress that our bounds are stable under rescalings 
of the payoff sequence, even in the most general case of payoffs with 
arbitrary signs. The issues of invariance by translations and rescalings, 
discussed more in depth in Section 5.3, show that — in some sense — the 
bounds introduced in this paper are more "fundamental" than previous 
results. In order to describe our results we first set up our model and 
notations, and then we review previous related works. 

In this paper we consider the following decision-theoretic variant 
proposed by Freund and Schapire (1997) of the framework of predic- 
tion with expert advice introduced by Littlestone and Warmuth (1994) 
and Vovk (1998). A forecaster repeatedly assigns probabilities to a fixed 
set of actions. After each assignment, the actual payoff associated to 
each action is revealed and new payoffs are set for the next round. The 
forecaster's reward on each round is the average payoff of actions for 
that round, where the average is computed according to the forecaster's 
current probability assignment. The goal of the forecaster is to achieve, 
on any sequence of payoffs, a cumulative reward close to X*, the high- 
est cumulative payoff among all actions. We call regret the difference 
between X* and the cumulative reward achieved by the forecaster on 
the same payoff sequence. 

In Section 2 we review the previously known bounds on the regret. 
The most basic one, obtained via the exponentially weighted average 
forecaster of Littlestone and Warmuth (1994) and Vovk (1998), bounds 
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the regret by a quantity of the order of My/n IniV, where N is the 
number of actions and M is a known upper bound on the magnitude 
of payoffs. 

In the special case of "one-sided games" , when all payoffs have the 
same sign (they are either always nonpositive or always nonnegative) , 
Freund and Schapire (1997) showed that Littlestone and Warmuth's 
weighted majority algorithm (1994) can be used to obtain a regret of 
the order of y / M\X*\ IniV +MlnN. (If all payoffs are nonpositive, then 
the absolute value of each payoff is called loss and \X* \ is the cumulative 
loss of the best action.) By a simple rescaling and translation of payoffs, 
it is possible to reduce the more general "signed game" , in which each 
payoff might have an arbitrary sign, to either one of the one-sided 
games, and thus, bounds can be derived using this reduction. However 
the transformation also maps \X*\ to either Mn + X* or Mn — X*, 
thus significantly weakening the attractiveness of such a bound. 

Recently, Allenberg-Neeman and Neeman (2004) proposed a direct 
analysis of the signed game avoiding this reduction. They proved that 
weighted majority (used in conjunction with a doubling trick) achieves 
the following: on any sequence of payoffs there exists an action j such 

that the regret is at most of order M (In N) J2t=i \xj~t\, where Xjj 
is the payoff obtained by action j at round t, and M = max^j \xi t t\ 
is a known upper bound on the magnitude of payoffs. Note that this 
bound does not relate the regret to the sum A* n = \xj* t i\ + - • - + \xj* in \ of 
payoff magnitudes for the optimal action j* (i.e., the one achieving X*). 
In particular, the bound of order MA^ In N + M In N for one-sided 
games is only obtained if an estimate of A* n is available in advance. 

In this paper we show new regret bounds for signed games. Our anal- 
ysis has two main advantages: first, no preliminary knowledge about the 
payoff magnitude M or about the best cumulative payoff X* is needed; 
second, our bounds are expressed in terms of sums of squared payoffs, 

such as x? 1 -\ h x'j n and related forms. These quantities replace the 

larger terms M(|xj 5 i| + • • ■ + |^j,n|) appearing in the previous bounds. As 
an application of our results we obtain, without any preliminary knowl- 
edge on the payoff sequence, an improved regret bound for one-sided 
games of the order of y/{Mn — \X*\)(\X*\/n)(\nN). 

Some of our bounds are achieved using forecasters based on weighted 
majority run with a dynamic learning rate. However, we are able to 
obtain second-order bounds of a different flavor using a new forecaster 
that does not use the exponential probability assignments of weighted 
majority. In particular, unlike virtually all previously known forecasting 
schemes, the weights of this forecaster cannot be represented as the 
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gradient of an additive potential (see the monograph by Cesa-Bianchi 
and Lugosi, 2006 for an introduction to potential-based forecasters). 

2. An overview of our results 

We classify the existing regret bounds as zero-, first-, and second-order 
bounds. A zero-order regret bound depends only on the number of 
time steps and on upper bounds on the individual payoffs. A first- 
order bound has a main term that depends on a sum of payoffs, while 
the main term of a second order bound depends on a sum of squares of 
the payoffs. In this section we will also briefly discuss the information 
which the algorithms require in order to achieve the bounds. 

We first introduce some notation and terminology. Our forecasting 
game is played in rounds. At each time step £ = 1,2,... the forecaster 
computes an assignment pt = {pi,t, ■ • • , PN,t) of probabilities over the 
N actions. Then the payoff vector Xt = (&i t, • • • , £j\r,t) £ ^ N for time £ 
is revealed and the forecaster's reward is x% = xi tPi t + • • • + XN,tPN,t- 

We define the cumulative reward of the forecaster by X n = x\ H h x n 

and the cumulative payoff of action i by X^ n = x^i + • • • + x^ n . For 
all n, let X* = maxj=i jv ^i,n be the cumulative payoff of the best 
action up to time n. The forecaster's goal is to keep the regret X* — X n 
as small as possible uniformly over n. 

The one-sided games mentioned in the introduction are the loss 
game, where x^t < for all i and £, and the gain game, where x^ t > 
for all i and £. We call signed game the setup in which no assumptions 
are made on the sign of the payoffs. 

2.1. Zero-order bounds 

We say that a bound is of order zero whenever it only depends on 
bounds on the payoffs (or on the payoff ranges) and on the number of 
time steps n. The basic version of the exponentially weighted average 
forecaster of Littlestone and Warmuth (1994) ensures that the order 
of magnitude of the regret is Ms/n InN where M is a bound on the 
payoffs: \xa\ < M for all £ > 1 and i = 1, . . . , N . (Actually, the factor 
M may be replaced by a bound E on the effective ranges of the payoffs, 
defined by \x^t — Xjj\ < E for all £ > 1 and i,j = l,..., N.) This basic 
version of this regret bound assumes that we have prior knowledge of 
both n and M (or E). 

In the case when n is not known in advance one can use a doubling 
trick (that is, restart the algorithm at times n = 2 k for k > IniV) 
and achieve a regret bound of the same order, M\fn In iV (only the 
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constant factor increases). Similarly, if M is not known in advance, 
one can restart the algorithm every time the maximum observed payoff 
exceeds the current estimate, and take the double of the old estimate 
as the new current estimate. Again, this influences the regret bound by 
only a constant factor. (The initial value of the estimate of M can be 
set to the maximal value in the first time step, see the techniques used 
in Section 3.) 

A more elegant alternative, rather than the restarting the algorithm 
from scratch, is proposed by Auer, Cesa-Bianchi, and Gentile (2002) 
who consider a time- varying tuning parameter rjt ~ (l/M) \J (In N)/t. 
They also derive a regret bound of the order of M y/n\nN uniformly 
over the number n of steps. Their method can be adapted along the 
lines of the techniques of Section 4.2 to deal with the case when M (or 
E) is also unknown. 

The results for the forecaster of Section 4 imply a zero-order bound 
sharper than E\fn\nM. This is presented in Corollary 1 and basically 



replaces Ey/n by y / E\ -\ + E%, where E t is the effective range of 



the payoffs at round t, 



2.2. One-sided games: first-order regret bounds 

We say that a regret bound is first-order whenever its main term de- 
pends on a sum of payoffs. Since the payoff of any action is at most 
Mn, these bounds are usually sharper than zero-order bounds. More 
specifically, they have the potential of a huge improvement (when, 
for instance, the payoff of the best action is much smaller than Mn) 
while they are at most worse by a constant factor with respect to their 
zero-order counterparts. 

When all payoffs have the same sign Freund and Schapire (1997) 
first showed that Littlestone and Warmuth's weighted majority algo- 
rithm (1994) can be used as a basic ingredient to construct a forecasting 
strategy achieving a regret of order ^/M|X*| In N + M In N where \X*\ 
is the absolute value of the cumulative payoff of the best action (i.e., 
the largest cumulative payoff in a gain game or the smallest cumulative 
loss in a loss game). 

In order to achieve the above regret bound, the weighted majority 
algorithm needs prior knowledge of \X*\ (or a bound on it) and of the 
payoff magnitude M. As usual one can overcome this by a doubling 
trick. Doubling in this case is slightly more delicate, and would re- 
sult in a bound of the order of y/M\X*\ IniV + M(ln Mn) In N. Here 




E t = max x i t 

i=l,...,N 



mm Xj t ■ 
j=i,..,N ht 



(1) 
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again, the techniques of Auer, Cesa-Bianchi, and Gentile (2002) could 
be adapted along the lines of the techniques of Section 4 to get a 
forecaster that, without restarting and without previous knowledge of 
M and X*, achieves a regret bounded by a quantity of the order of 
y/M\X* | In N + M In N. 

2.3. Signed games: first-order regret bounds 

As mentioned in the introduction, one can translate a signed game 
to a one-sided game as follows. Consider a signed game with payoffs 
Xi,t G [—M,M]. Provided that M is known to the forecaster, he may 
use the translation x\ t = xij+M to convert the signed game into a gain 
game. For the resulting gain game, by using the techniques described 
above, one can derive a regret bound of the order of 

^(ln N) {Mn + X*) + M In N . (2) 

Similarly, using the translation x\ t = Xij — M, we get a loss game, for 
which one can derive the similar regret bound 

^(ln N) {Mn - X*) + M In N . (3) 

The main weakness of the transformation is that the bounds (2) and (3) 
are essentially zero-order bounds, though this depends on the precise 
value of X*. (Note that when M is unknown, or to get tighter bounds, 
one may use the translation x\ t = Xj )t — minj^. ^jv Xj t t from signed 
games to gain games, or the translation x\ t = X{^ — maxj=i jv Xjj 
from signed games to loss games.) 

Recently, Allenberg-Neeman and Neeman (2004) proposed a direct 
analysis of the signed game avoiding this reduction. They give a simple 
algorithm whose regret is of the order of \J MA^ In N + M In N where 
A* n = + • • • + kfc*,n| is the sum of the absolute values of the 

payoffs of the best expert k* t for the rounds 1, . . . , n. Since A* n = |X*| 
in case of a one-sided game, this is indeed a generalization to signed 
games of Freund and Schapire's first-order bound for one-sided games. 
Though Allenberg-Neeman and Neeman need prior knowledge of both 
M and A* n to tune the parameters of the algorithm, a direct extension 
of their results along the lines of Section 3.1 gives the first-order bound 



.nN) max A% +M\nN 

t=l,...,n 

t 

M(\nN) max ^\x k * >s \ + MlnN (4) 

* =lv "' n s=l 
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which holds when only M is known. 

2.4. Second-order bounds on the regret 

A regret bound is second-order whenever its main term is a function 
of a sum of squared payoffs (or on a quantity that is homogeneous in 
such a sum) . Ideally, they are a function of 



Qn = E 



X 



t=l 



Expressions involving squared payoffs are at the core of many analy- 
ses in the framework of prediction with expert advice, especially in the 
presence of limited feedback. (See, for instance, the bandit problem, 
studied by Auer et al., 2002, and more generally prediction under par- 
tial monitoring and the work of Cesa-Bianchi, Lugosi, and Stoltz, 2005, 
Cesa-Bianchi, Lugosi, and Stoltz, 2004, Piccolboni and Schindelhauer, 
2001.) However, to the best of our knowledge, the bounds presented 
here are the first ones to explicitly include second-order information 
extracted from the payoff sequence. 

In Section 3 we give a very simple algorithm whose regret is of the 
order of ^JQ* n In N + Mln N. Since Q* n < MA* n , this bound improves 
on the first-order bounds. Even though our basic algorithm needs prior 
knowledge of both M and Q* to tune its parameters, we are able to 
extend it (essentially by using various doubling tricks) and achieve a 
bound of the order of 



(In iV) t max Q* 



+ MlnN 



(In N) max 

v ' t=l,...,n 



t 

E 

3=1 



+ MlnN (5) 



without using any prior knowledge about Q* . (The extension is not as 
straightforward as one would expect, since the quantities Q$ are not 
necessarily monotone over time.) 

Note that this bound is less sensitive to extreme values. For instance, 
in case of a loss game (i.e., all payoffs are nonpositive), Ql < ML^, 
where L\ is the cumulative loss of the best action up to time t. There- 
fore, max s <„ Q* < ML* n and the bound (5) is at least as good as 
the family of bounds called "improvements for small losses" (or first- 
order bounds) presented in Section 2.2. However, it is easy to exhibit 
examples where the new bound is far better by considering sequences of 
outcomes where there are some "outliers" among the Xi : t- These outliers 
may raise the maximum M significantly, whereas they have only little 
impact on the max s < n Q*. 
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We also analyze the weighted majority algorithm in Section 4, and 
show how exponential weights with a time varying parameter can be 
used to derive a regret bound of the order of y/V n In A + ElnN where 
V n is the cumulative variance of the forecaster's rewards on the given 
sequence and E is the range of the payoffs. (Again, we derive first the 
bound in the case where the payoff range is known, and then extend 
it to the case where the payoff range is unknown.) The above bound 
is somewhat different from standard regret bounds because it depends 
on the predictions of the forecaster. In Sections 4.4 and 5 we show how 
one can use such a bound to derive regret bounds which only depend 
on the sequence of payoffs. 



3. A new algorithm for sequential prediction 



We introduce a new forecasting strategy for the signed game. In The- 
orem 4, the main result of this section, we show that, without any 
preliminary knowledge of the sequence of payoffs, the regret of a variant 
of this strategy is bounded by a quantity defined in terms of the sums 
Qi,n = hx? n . Since Q^ n < M(\x it i\-\ \-\x itn \), such second- 
order bounds are generally better than all previously known bounds 
(see Section 2). 

Our basic forecasting strategy, which we call prod(ry), has an input 
parameter 77 > and maintains a set of N weights. At time t = 1 
the weights are initialized with wn = 1 for i = 1,...,N. At each 
time t = 1,2,..., prod(ry) computes the probability assignment pt = 

(pi >t , ■ ■ ■ ,PN,t), where p ijt = w i>t /W t and W t = w 1<t H h wjv,t- After 

the payoff vector x t is revealed, the weights are updated using the rule 
Wi t t+\ = Wi t t(l + V x i,t)- The following simple fact plays a key role in 
our analysis. 

Lemma 1. For all z > -1/2, ln(l + z) > z - z 2 . 



Proof. Let f(z) = ln(l + z) - z + z 2 . Note that 

so that f'(z) < for -1/2 < z < and f'(z) > for z > 0. Hence the 
minimum of / is achieved in and equals 0, concluding the proof. □ 



We are now ready to state a lower bound on the cumulative reward of 
prod(?7) in terms of the quantities Qk,n- 
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Lemma 2. Assume there exists M > such that the payoffs satisfy 
%i,t > — M for t = 1, . . . , n and i = 1, . . . ,N. For any sequence of 
payoffs, for any action k, for any rj < 1/(2M), and for any n > 1, the 
cumulative reward of prod(7/) is lower bounded as 

In AT 

X n > X kn Tj Q kn ■ 

17 

Proof. For any k = l,...,N, note that x k ,t > ~M and r? < 1/(2M) 
imply ryxfe )t > —1/2. Hence, we can apply Lemma 1 to rjx kjt and get 

, w n+1 

n n 

> In = - In N + In J] (1 + r,x ktt ) = - In N + £ ln(l + rp^) 

n 

> - In iV + (vx k ,t ~ r] 2 xl t ) = - In AT + r/X^ - r/ 2 Q fe , n . (6) 

t=i 

On the other hand, 

yyl t=l * *=1 \i=l / 

n / N \ 

= ^ In 1 + ?? ^ Xj 5 tp M < r/X n (7) 
t=i \ i=i / 

where in the last step we used ln(l+z t ) < zt for all z t = rj J2iLi %i,tPi,t > 
— 1/2. Combining (6) and (7), and dividing by r\ > 0, we get 

X n > 1- X k n - rj Q k n . 

7] 

Our choice of rj gives the claimed bound. □ 

By choosing rj appropriately, we can optimize the bound as follows. 

Theorem 1. Assume there exists M > such that the payoffs satisfy 
%i,t > — M for t = 1, . . . , n and i = 1, . . . , N. For any Q > 0, if prod(?7) 
is run with 

r ? = min|l/(2M), v /(lniV)/Q| (8) 

then for any sequence of payoffs, for any action k, and for any n > 1 
such that Q k)H < Q, 

X n > X Kn - max ^2y/ QhiN , 4M In Ar| . 
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3.1. Unknown bound on quadratic variation (Q) 

To achieve the bound stated in Theorem 1, the parameter r] must be 
tuned using preliminary knowledge of a lower bound on the payoffs 
and an upper bound on the quantities Qk,n- I n this and the following 
sections we remove these requirements one by one. We start by intro- 
ducing a new algorithm that, using a doubling trick over prod, avoids 
any preliminary knowledge of an upper bound on the Qk, n - 

Let k\ be the index of the best action up to time t; that is, k\ 6 
argmax fc X^ t (ties are broken by choosing the action k with minimal 
associated Qk,t)- We denote the associated quadratic penalty by 



Ideally, our regret bound should depend on Q* and be of the form 
y/Qn In N + M In N. However, note that the sequence Q\, Q%, . . . is not 
necessarily monotone, since if at time t+1 the best action changes, then 
Q\ and Ql + i are not related. Therefore, we cannot use a straightforward 
doubling trick, as this only applies to monotone sequences. Our solution 
is to express the bound in terms of the smallest nondecreasing sequence 
that upper bounds the original sequence {Qt)t>\- This is a general trick 
to handle situations where the penalty terms are not monotone. 

Let prod-Q(M) be the prediction algorithm that receives a quantity 
M > as input parameter and repeatedly runs prod(r/ r ), where rj r is 
defined below. The parameter M is a bound on the payoffs, such that 
for all i = 1, . . . , N and t = 1, . . . , n, we have \xi t t\ < M. The r-th 
parameter rj r corresponds to the parameter r\ defined in (8) for M and 
Q = 4 r M 2 . Namely, we choose 



We call epoch r, r = 0, 1, . . ., the sequence of time steps when prod-Q 
is running prod(ry r ). The last step of epoch r > is the time step t = t r 
when Q\ > 4 r M 2 happens for the first time. When a new epoch r + 1 
begins, prod is restarted with parameter r) r +i. 

Theorem 2. Given M > 0, for all n > 1 and all sequences of payoffs 
bounded by M, i.e., maxi<j<jv maxi< t < n Ixj^l < M, the cumulative 
reward of algorithm prod-Q (M) satisfies 



t 



Qt — Qt* — y^. x i*.. 



,s 



s=l 





varbounds6.tex; 8/02/2008; 18:15; p. 10 



11 

Proof. We denote by R the index of the last epoch and let tR = n. 
If we have only one epoch, then the theorem follows from Theorem 1 
applied with a bound of Q = M 2 on the squared payoffs of the best 
expert. Therefore, for the rest of the proof we assume R > 1. Let 

4 r) = E Xk ' s ' Qt ] = E 4.. * (r) = E 2. 

S=t r — 1 + 1 S=t r — 1 + 1 S=t r — 1+1 

where the sums are over all the time steps s in epoch r except the last 
one, t r . (Here i_i is conventionally set to 0.) We also denote k r = _ 1 
the index of the best overall expert up to time t r — 1 (one time step 

before the end of epoch r). We have that < Qk r ,t r -i = Q* r -i- Now, 
by definition of the algorithm, Qt r -i < 4 r M 2 . Theorem 1 (applied to 
time steps t r -\ + 1, . . . , t r — 1) shows that 

£(r) > x (r) _ max J 2 V4 r M 2 ln]V ,4Mlniv} . 



The maximum in the right-hand side equals 2 r+1 Mv4n _ /V when r > 
ro = 1 + |_(log2 In N)/2\. Summing over r = 0, . . . , R we get 

r=0 



> E + X S) - 4 (! + r o) M ln N - E IniV 

r=0 r=ro+l 

> E ( £ *r,tr + " 4 (! + r o)M IniV - 2 R + 2 MVhTiV 



r=0 



> E^ - (R+^M- 4 (! + r o)M IniV- 2 R+2 Mv / hTiV . (9) 

r=0 



Now, since fco is the index of the expert with largest payoff up to time 

t - 1, we have that X fclltl _i = x£ } + ^i,*o + X S ^ X kJ + X ^ + M - 
By a simple induction, we in fact get 



R-l 



r=0 



As, in addition, X kRt t R -i = X fc *_ i;n _i and X fc *,„ may only differ by at 
most M, combining (9) and (10) we have indeed proven that 

X n > X K<n - (2(R + 1)M + 4M(1 + r ) In TV + 2 R+2 Mv4n~iV) . 
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The proof is concluded by noting first, that R < log 4 n, and second that, 
as R > 1, max s <„ Q* s > A R ~ 1 M 2 by definition of the algorithm. □ 

3.2. Unknown bound on payoffs (M) 

In this section we show how one can overcome the case when there is 
no a priori bound on the payoffs. In the next section we combine the 
techniques of this section and Section 3.1 to deal with the case when 
both parameters are unknown 

Let prod-M(Q) be the prediction algorithm that receives a number 
Q > as input parameter and repeatedly runs prod(?7 r ), where the rj r , 
r = 0, 1, . . ., are defined below. We call epoch r the sequence of time 
steps when prod-M is running prod(r/ r ). At the beginning, r = and 
prod-M(Q) runs prod(?7o), where 

M = ^(4 In AO and m = 1/(2M ) = y/Qn N)/Q . 
For all t > 1, we denote 

M t = max max 2 riog 2 K-H . 

s=l,...,t i=l,...,N 

The last step of epoch r > is the time step t = t r when M t > M tr _ 1 
happens for the first time (conventionally, we set Mt_ 1 = Mq). When 
a new epoch r + 1 begins, prod is restarted with parameter r] r+ i = 
V(2M tr ). 

Note that rjo = l/(2Mo) in round and r\ r = l/(2M tr _ 1 ) in any 
round r > 1, where M to > Mq and M tr > 2M tr l for each r > 1. 

Theorem 3. For any sequence of payoffs, for any action k, and for 
any n > 1 such that Qk,n < Q-, the cumulative reward of algorithm 
prod-M(Q) is lower bounded as 

X n > X Kn - 2^Q In N - 12 M (1 + In A) 
where M = maxi<j<Af maxi< t <„ |x i)t |. 

Proof. As in the proof of Theorem 2, we denote by R the index of the 
last epoch and let tu = n. We assume R > 1 (otherwise, the theorem 
follows directly from Theorem 1 applied with a lower bound of —Mq on 
the payoffs). Note that at time n we have either M n < M tfl l , implying 
M n = M tR = Mt^, or M n > M tjJ _ 1 , implying M n = M tfl = 2M tfl _ 1 . 
In both cases, M tR > M tR _ 1 . In addition, since R > 1, we also have 
< 2M. 
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Similarly to the proof of Theorem 2, for all epochs r and actions k 
introduce 

4 r) = E Q^= E 4.. 1(r) = E 

S = t r _l+1 S=t T ._l + l S=t r _l + 1 

where, as before, we set t-\ = 0. Applying Lemma 2 to each epoch 
r = 0, . . . , R we get that X„ — is equal to 

R R 

X n - X Kn = £ (£M - if) + - 

r=0 r=0 

>-E^-E^ r) + E(^-^) • 

r=0 /r r=0 r=0 



We bound each sum separately. For the first sum, since M tg > 2 s r M tr 
for each 0<r<s<R — 1, we have for s < i? — 1, 

^M ir <f r s M,<2M is . (11) 

r=0 r=0 

Thus, 

R 1 

E — = E 2M <,-i ln ^ < 2 ( M «-i + 2^,.,) In N < 6M tR In A 

r=0 ^ r=0 

where we used (11) and M t _ 1 = Mq < M tR _ 1 < M tR . For the second 
sum, using the fact that r\ r decreases with r, we have 



R R I 

E VrQV < vo E Qk' ] ^ ^Qk,n < J 1 -^- Q = Vq^n . 

r=0 r=0 V ^ 

Finally, using (11) again, 
R R 

E \2u ~ Xk,t r \ < £ 2M tr < 2 (2M tR _ 1 + M tR ) <6M tR . 

r=0 r=0 

The resulting lower bound 6M tR (l + In N) + \/Qlia N implies the one 
stated in the theorem by recalling that, when R > 1, M tR <2M. □ 
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3.3. Unknown bounds on both payoffs (M) and quadratic 

VARIATION (Q) 

We now show a regret bound for the case when M and the Qk, n are 
both unknown. We consider again the notation of the beginning of 
Section 3.1. The quantities of interest for the doubling trick of Sec- 
tion 3.1 were the homogeneous quantities (1/M 2 ) max s < t Q*. Here we 
assume no knowledge of M. We propose a doubling trick on the only 
homogeneous quantities we have access to, that is, max s <( (Q*/M 2 ), 
where Mt is defined in Section 3.2 and the maximum is needed for the 
same reasons of monotonicity as in Section 3.1. 

We define the new (parameter less) prediction algorithm prod-MQ. 
Intuitively, the algorithm can be thought as running, at the low level, 
the algorithm prod-Q(M t ). When the value of M t changes, we restart 
prod-Q(M t ), with the new value but keep track of Q*. 

Formally, we define the prediction algorithm prod-MQ in the follow- 
ing way. Epochs are indexed by pairs (r, s). At the beginning of each 
epoch (r,s), the algorithm takes a fresh start and runs prod(r]( r ^), 
where r/( r s ), for r = 0, 1, . . . and s = 0, 1, . . ., is defined by 

r/ (rjS) =min{l/(2M( r )), Vh^N / (2 Sr ~ 1+s M^ } 

and AfM, S p are defined below. 

At the beginning, r = 0, s = 0, and since prod(ry) always sets p± 
to be the uniform distribution irrespective to the choice of 77, without 
loss of generality we assume that prod is started at epoch (0, 0) with 
M (°) = Mi and 5_i = 0. 

The last step of epoch (r, s) is the time step t = i( rjS ) when either: 

(CI) Q\ > 4 5 - 1+s M t 2 happens for the first time 

or 

(C2) M t > happens for the first time. 

If epoch (r,s) ends because of (CI), the next epoch is (r, s + 1), and 
the value of is unchanged. If epoch (r, s) ends because of (C2), 
the next epoch is (r + 1, 0), S r = SV-i + s, and M^ r+1 ^ = M t . 

Note that within epochs indexed by the same r, the payoffs in all 
steps but the last one are bounded by M^ r \ Note also that the quan- 
tities S r count the number of times an epoch ended because of (CI). 
Finally, note that there are S r — SV-i + 1 epochs (r, s) for a given r > 0, 
indexed by s = 0, . . . , S r — S r -±. 
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Theorem 4- For any sequence of payoffs and for any n > 1, the 
cumulative reward of algorithm prod-MQ satisfies 

X n > X* - 32M v / ^ln~iV 

- 22M (1 + In N) - 2M log 2 n - AM f (log 2 In iV) /2] 
= X* - O (My/q]nN + Mlnn + Mlniv) 

f Q* 

where M = maxi<j<7v maxi<(< n |xj J and q = max < 1, max —pr 

I s<n Mg 

The proof is in the Appendix. 



4. Second-order bounds for weighted majority 

In this section we derive new regret bounds for the weighted major- 
ity forecaster of Littlestone and Warmuth (1994) using a time- varying 
learning rate. This allows us to avoid the doubling tricks of Section 3 
and keep the assumption that no knowledge on the payoff sequence is 
available to the forecaster beforehand. 

Similarly to the results of Section 3, the main term in the new 
bounds depends on second-order quantities associated to the sequence 
of payoffs. However, the precise definition of these quantities makes the 
bounds of this section generally not comparable to the bounds obtained 
in Section 3. 

The weighted majority forecaster using the sequence 172,173, • • • > 
of learning rates assigns at time t a probability distribution p t over the 
N experts defined by p\ = (1/N, . . . , 1/N) and 

Pi^ ^f '*x\ , fori = l,...,iVandt>2. (12) 

Note that the quantities rjt > may depend on the past payoffs Xi tS , 
i = 1,...,N and s = 1, . . . , t — 1. The analysis of Auer, Cesa-Bianchi, 
and Gentile (2002), for a related variant of weighted majority, is at the 
core of the proof of the following lemma (proof in Appendix) . 

Lemma 3. Consider any nonincreasing sequence 772, • • ■ of positive 
learning rates and any sequence xi,X2,--- € of payoff vectors. 
Define the nonnegative function $ by 

N 1 N 

<f>(p t , r)t, x t ) = -V^ii-lnVft/' 1 ''' 

N - \ 

i=i / 




varbounds6.tex; 8/02/2008; 18:15; p. 15 



16 

Then the weighted majority forecaster (12) run with the sequence 772, 
773, . . . satisfies, for any n > 1 and for any 771 > 772, 

/ 9 1 \ n 
X n - X* > IniV - 5>(Pt, 7? t , 35*) . 

W+l t ^ 

Let Z 4 be the random variable with range {xi 5 t, . . . , XN,t} and distribu- 
tion p t . Note that EZ t is the expected payoff xt of the forecaster using 
distribution p t at time t. Introduce 

N / N 

Var Z t = EZ? - E 2 Z t = Y^Pi,t x lt ~ E^M 

i=l \i=l 

Hence Var Zt is the variance of the payoffs at time t under the distri- 
bution p t and the cumulative variance V n = Var Z\ + ■ ■ ■ + Var Z n is 
the main second-order quantity used in this section. The next result 
bounds &(pt, f]t, Xt) in terms of Var Z%. 

Lemma 4- For all payoff vectors Xt = (x± t t, ■ ■ ■ ,xnj), ah probability 
distributions pt = (pi,t, ■ ■ ■ ,PN,t), and all learning rates rj t > 0, we have 

$(pt, i] t , x t ) < E 

where E is such that \xa — Xjt\ < E for all i,j = 1,...,N. If, in 
addition, < r]t\xi t t — Xjj\ < 1 for all i,j = l,...,N, then 

$(pt, r) t , x t ) < (e- 2)r] t VarZ t . 

Proof. The first inequality is straightforward. To prove the second 
one we use e a < 1 + a + (e — 2) a 2 for |a| < 1. Consequently, noting that 
i]t\Xi,t — xt\ < 1 for all i by assumption, we have that 

1 / N 

- — ln Ep*.' I 1 + ^(^M ~ + ( e - 2 )Vt( x i,t ~ Xtf) 
^ \i=l 

Using ln(l + a) < a for all a > — 1 and some simple algebra concludes 
the proof of the second inequality. □ 

In Auer et al. (2002) a very similar result is proven, except that there 
the variance is further bounded (up to a multiplicative factor) by the 
expectation x t of Z t . 
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4.1. Known bound on the payoff ranges (E) 

We now introduce a time- varying learning rate based on V n . For sim- 
plicity, we assume in a first time that a bound E on the payoff ranges 
Et, defined in (1), is known beforehand and turn back to the general 
case in Theorem 6. The sequence r]2, . . . is defined as 



1 „ hxN , s 

'" =mm (13) 



for t > 2, with C = y' 2 [y/2 - lj /(e - 2) » 1.07. 

Note that r\t depends on the forecaster's past predictions. This is in 
the same spirit as the self-confident learning rates considered in Auer, 
Cesa-Bianchi, and Gentile (2002). 

Theorem 5. Provided a bound E on the payoff ranges is known be- 
forehand, i.e., maxt=i n maxj,j=i,...,jv \ x i,t ~ x j,t\ < ^ the weighted 
majority forecaster using the time-varying learning rate (13) achieves, 
for all sequences of payoffs and for all n > 1 , 



X n -X*> -4y/V n ]nN - 2E In N - E/2 . 

Proof. We start by applying Lemma 3 using the learning rate (13), 
and setting i]i = 772 for the analysis, 

X n — X* 

/ 2 1 \ n 

> lnAT-^$(p t , n t , x t ) 

n 

> -2max{ J ElniV, (l/C)y/V n kiN} - (e - 2) ^ r/ t Var Z t 

t=i 

where C is defined in (13) and the second inequality follows from the 
second bound of Lemma 4. We now denote by T the first time step t 
when V t > E 2 /4. Using that rj t <l/E for all t and V T < E 2 /2, we get 

n E n 

m Var Zt < -+ ^ Var Z * ■ ( 14 ) 

t=l t=T+l 

We bound the last sum using rjt < C\/(lniV)/Vt_i for t > T + 1 (note 
that, for t > T + 1, V*_i > Vr > ^ 2 / 4 > 0)- Thi s yields 

t=T+l t=T+l V^t-l 
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Since V t < V t -i + E 2 /4 and V t -\ > E 2 /A for t > T + 1, we have 



t-i 



t-i 



(y/Vt- VW^) 



< (V2 + i)(Vv t - VWX) 

Therefore, by a telescoping argument, 



y/2-l 



t=T+l V 2 1 



(15) 



< 



C 



v 7 ^- 1 

Putting things together, we have already proved that 

X n -X* > -2max{£lniV, (l/CVK hiiv} 

2 y/2- 1 

In the case when \/Vri > CEy/hiN, the regret X n — X* is bounded 
from below by 



2 + C(e-2) 



C y/2-1 



VV n lnN 



e-2 



E > -A^/V n \nN - E/2 , 



where we substituted the value of C and obtained a constant for the 



leading term equal to 2^2(6 - 2)/y y/2 - 1 < 3.75. When < 
CEVhi iV, the lower bound is more than 

C(e-2) 



-2£lniV - 



y/V n \a.N- e —^E 



y/2-1 V 2 
> -2E In N - 2^V n In JV - E/2 . 



This concludes the proof. 



□ 



4.2. Unknown bound on the payoff ranges (E) 

We present the adaptation needed when no bound on the real-valued 
payoff range is known beforehand. For any sequence of payoff vectors 
Xi, X2, • • • and for all t = 1, 2, . . ., we define, similarly to Section 3.2, 
a quantity that keeps track of the payoff ranges seen so far. More 
precisely, E t = 2 k , where k G Z is the smallest integer such that 
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max s= i v .. )i maxjj^.^Tv \xi iS —Xj jS \ < 2 k . Now let the sequence r}2, %, . . . 
be denned as 




for t > 2, with C = yj 2 (^2 - lj /(e - 2). 

We are now ready to state and prove the main result of this section, 
which bounds the regret in terms of the variance of the predictions. We 
show in the next section how this bound leads to more intrinsic bounds 
on the regret. 

Theorem 6. Consider the weighted majority forecaster using the time 
varying learning rate (16). Then, for all sequences of payoffs and for all 
n > 1, 

X n - X* > -4y/V n ]nN -iElnN -6E 
where E = max t =i,..., n maxi,j=i,...,N \x%,t ~ %j,t\- 

Proof. The proof is similar to the one of Theorem 5, we only have to 
deal with the estimation of the payoff ranges. We apply again Lemma 3, 

X n -X* > - {— - -) IniV - £ $(p t , th, xt) 

\Vn+i viJ fr[ 

n 

> -2 max {E n In AT, (1/C)VvJ^n} - £ $(p t , m , x t ) 

t=i 

= -2max{£ n lniV, (l/C) ^V n In iv} 
- $ (P*> Vt, x t ) - ®(Pt, Vt, x t ) 

where C is defined in (16), and T is the set of time steps t > 2 when 
E t = E t -i (note that 1 T by definition). Thus T is a finite union of 
intervals of integers, T = [1, n\ \ {t\, . . . , £r}, where we denote t± = 1 
and let *2, ■ ■ ■ , *r be the time rounds i > 2 such that £t / E t -\. 

Using the second bound of Lemma 4 on t € T (since, for t E T, 
rjtE t < E t /E t -i = 1) and the first bound of Lemma 4 on i T, which 
in this case reads 3>(pt, rft, Xt) < -Et, we get 

^n-^C > -2max{^„lniV, (1/C)\/V^ lnivj 

-(e-2)^r ?t VarZ t -^^ . (17) 

teT t^T 

We consider the r-th regime, r = 1,. . . ,R, that is, the time steps s 
between t r + l and t r+ \ — 1 (with £.r+i = n by convention whenever tn < 
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n). For all these time steps s, E s = E tr . We use the same arguments 
that led to (14) and (15): denote by T r the first time step s > t r + 1 
when V s > E^JA. Then, 



tr+l — 1 



j^v^f + ^^-v^) 



Summing over r = 1, . . . , R and noting that a telescoping argument is 
given by V tr < V Tr , 

ter v 2 - 1 ^ r=1 

We deal with the last sum (also present in (17)) by noting that 
R ri°g 2 E] 

t£T r=l r=— oo 

Putting things together, 

^n-^ > -2max{£ n lniV, (1/C) v 7 ^ In iv} 
(e - 2)C , v / ln]V 



V n - 2eE . 



y/2-1 

The proof is concluded, as the previous one, by noting that E n < 2E. □ 



4.3. Randomized prediction and actual regret 

In this paper, the focus is on improved bounds for the expected re- 
gret. After choosing a probability distribution p t on the actions, the 
forecaster gets xt = XijPij + ■ ■ ■ + XN,tPN,t as a reward. In case ran- 
domized prediction is considered, after choosing p t , the forecaster draws 
an action I t at random according to p t and gets the reward x/ t ,t, whose 
conditional expectation is Xf In this version of the game of prediction, 
the aim is now to minimize the (actual) regret, defined as the difference 
between xi lt i + • • • + xj„, n and X*. 

Bernstein's inequality for martingales (see, e.g., Freedman, 1975) 
shows however that the actual regret of any forecaster is bounded by 
the expected regret with probability 1 — 5 up to deviations of the 
order of y/V n \n(n/S) + M\n(n/5). These deviations are of the same 
order of magnitude as the bound of Theorem 6. Unless we are able 
to apply a sharper concentration result than Bernstein's inequality, no 
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further refinement of the above bounds is worthwhile. In particular, in 
view of the deviations from the expectations, as far as actual regret is 
concerned, we may prefer the results of Section 4 to those of Section 3. 
The next section, as well as Section 5, explain how bounds in terms of 



\/Vn lead to many interesting bounds on the regret that do not depend 

on quantities related to the forecaster's rewards. 

4.4. Bounds on the forecaster's cumulative variance 

In this section we show a first way to deal with the dependency of 
the bound on V n , the forecaster's cumulative variance. Section 5 will 
illustrate this further. 

Recall that Zt is the random variable which takes the value Xij with 
probability pi t t, for i = 1, . . . , N. The main term of the bound stated 
in Theorem 6 contains V n = Var Z\ + • • • + VarZ n . Note that V n is 
therefore smaller than all quantities of the form 



where (m)t>i is any sequence of real numbers which may be chosen in 
hindsight, as it is not required for the definition of the forecaster. (The 
minimal value of the expression is obtained for [it = xt-) This gives us 
a whole family of upper bounds, and we may choose for the analysis 
the most convenient sequence of jj,f 

To provide a concrete example, recall the definition (1) of payoff 
effective range E t and consider the choice [it = Toai^j=i,...,N %j,t + E t /2. 

Corollary 1. The regret of the weighted majority forecaster with vari- 
able learning rate (16) satisfies 



where E is a bound on the payoff ranges, E = maxt=i i ... !n Et. 

The bound proposed by Corollary 1 shows that for an effective range 
of E, say if the payoffs all fall in [0, E], the regret is lower bounded by a 
quantity equal to — 2£Vnm N (a closer look at the proof of Theorem 6 
shows that this constant factor is less than 1.9, and could be made 
as close to 2>/ (e — 2) = \p2\j2 (e — 2) as desired). The best leading 
constant for such bounds is, to our knowledge, \pi (see Cesa-Bianchi 
and Lugosi, 2006). This shows that the improved dependence in the 
bound does not come at a significant increase in the magnitude of the 




n N 



t=l i=l 
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leading coefficient. When the actual ranges are small, these bounds give 
a considerable advantage. Such a situation arises, for instance, in the 
setting of on-line portfolio selection, when we use linear upper bound 
on the regrets (see, e.g., the EG strategy by Hehnbold et al., 1998). 
Moreover, we note that Corollary 1 improves on a result of Allenberg- 
Neeman and Neeman (2004), who show a regret bound, in terms of the 
cumulative effective range, whose main term is 5.7\/2M(ln N) J2t=i Et, 
for a given bound M over the payoffs. 

Finally, we note that using translations of payoffs for prod-type 
algorithms, as suggested by Section 5.1, may be worthwhile as well, 
see Corollary 4 below. However, unlike the approach presented here 
for the weighted majority based forecaster, there the payoffs have to 
be translated explicitly and on-line by the forecaster, and thus, each 
translation rule corresponds to a different forecaster. 

4.5. Extension to problems with incomplete information 

An interesting issue is how the second-order bounds of this section ex- 
tend to incomplete information problems. In the literature of this area, 
exponentially weighted averages of estimated cumulative payoffs play a 
key role (see, for instance, Auer et al., 2002 for the multiarmed bandit 
problem, Cesa-Bianchi, Lugosi, and Stoltz, 2005 for label-efficient pre- 
diction, and Piccolboni and Schindelhauer, 2001, Cesa-Bianchi, Lugosi, 
and Stoltz, 2004 for prediction under partial monitoring). 

A careful analysis of the proofs therein shows that the order of 
magnitude of the bound on the regret is given by the root of the sum 
of the conditional variances of the estimates of the payoffs used for 
prediction, 



Here we denote by x^t the (unbiased) estimate available for x^t (whose 
form varies depending on the precise setup and the considered strat- 
egy), by pt = (pi t, • • • ,PN,t) the probability distributions over the 
actions, and by Et the conditional expectation with respect to the 
information available up to round t (for instance, in multiarmed bandit 
problems, this information is the past payoffs). Note that the condi- 
tioning in Et determines the values of the payoffs x t = (xi,t, • • • ,XN,t) 
and of pt- 

In setups with full monitoring, that is, for the setups considered in 
this paper, no estimation is needed, x^t = £j,t, arid the bound is exactly 
that of Theorem 6. 
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In multiarmed bandit problems (with payoffs in, say, [— M, M]), the 
estimators are given by x^t = (xi,t/Pi,t)^[it=i] where It is the index of 
the chosen component of the payoff vector. Now, 



Pi,t x i,t 



= At < M " ■ (18) 



Summing over i = 1, . . . , N and t = 1, . . . , n the bound MV nN In N of 
Auer et al. (2002) is recovered. 

In label-efficient prediction problems, Xj ;t = (x i)t /e)Z t , where the Z t 
are i.i.d. random variables distributed according to a Bernoulli distri- 
bution with parameter e ~ m/n. Then, 



Pi,t x i,t 



A,t . m 2 
Pi,t — < Pi,t — 

£ £ 



Summing over i = 1, . . . , N and t = 1, . . . , n we recover the bound 
My/(n/e)lnN ~ Mn^ (In N)/m of Cesa-Bianchi, Lu gosi, and Stoltz 
(2005). 

Finally, in games with partial monitoring, the quantity (18) is less 
than M 2 t _1 / 3 A^ 2 / 3 (ln A^) -1 / 3 . Summing over i = 1,...,N and t = 
l,...,n we recover the Mn 2 / 3 iV 2 / 3 (In N) 1 / 3 bound of Cesa-Bianchi, 
Lugosi, and Stoltz (2004). 

In conclusion, the faster y/n rate in bandit problems, as opposed to 
the n 2 / 3 rate in problems of prediction under partial monitoring, is due 
to better statistical performances (i.e., smaller conditional variance) of 
the available estimators. 



5. Using translations of the payoffs 

We now consider the bounds derived from those of Sections 3 and 4 in 
the case when translations are performed on the payoffs (Section 5.1). 
We show that they lead to several improvements or extensions of earlier 
results (Section 5.2) and also relieve the forecaster from the need of any 
preliminary manipulation on the payoffs (Section 5.3). 

5.1. On-line translations of the payoffs 

Note that any on-line forecasting strategy may be used by a meta- 
forecaster which, before applying the given strategy, may first translate 
the payoffs according to a prescribed rule that may depend on the past. 
More formally, the meta-forecaster runs the strategy with the payoffs 
r k,t = %k,t — ^t-, where fit is any quantity possibly based on the past 
payoffs Xi tS , for i = 1, . . . , N and s = 1, . . . , t. 
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The forecasting strategies of Section 4 (and the obtained bounds) 
are invariant by such translations. This is however not the case for 
the prod-type algorithms of Section 3. An interesting application is 
obtained in Section 5.2 by considering fi t = x t where we recall that 
xt = X\£Pi,t + • • • + XN,tVN,t is the forecaster's reward at time t. As the 
sums jUi + ■ ■ ■ + n n cancel out in the difference X n — X k , n , we obtain the 
following corollary of Theorem 2. Note that the remainder term here is 
now expressed in terms of the effective ranges (1) of the payoffs. 

Corollary 2. Given E > 0, for all n > 1 and all sequences of pay- 
offs with effective ranges E t bounded by E, the cumulative reward of 
algorithm prod-Q(E') run using translated payoffs x k ,t — xt satisfies 

X n > X* - 8, / '(In N) max R* 

- 2£(l + log 4 n + 2(l+ L(log 2 lnA0/2j)lniv) . 

where the R* are defined as follows. For 1 < t < n and k = 1, . . . , N, 

Rk,t = (xk,i ~ xi) 2 H h (x kyt - x t ) 2 and R* t = R k * jt , where k* t is the 

index of the action achieving the best cumulative payoff at round t (ties 
are broken by choosing the action k with smallest associated Rk,t)- 

Remark 1. In one-sided games, for instance in gain games, the fore- 
caster has always an incentive to translate the payoffs by the minimal 
payoff fit obtained at each round t, 

fit = . min x k ,t ■ 

j=l,...,N 

This is since for all j and t, (xjj — fit) 2 < x 2 t in a gain game. The issue 
is not so clear however for signed games, and it may be a delicate issue 
to determine beforehand if the payoffs should be translated, and if so, 
which translation rule should be used. See also Section 4.4, as well as 
Section 5.2. 

5.2. Improvements for small or large payoffs 

As recalled in Section 2.2, when all payoffs have the same sign Fre- 
und and Schapire (1997) first showed that Littlestone and Warmuth's 
weighted majority algorithm (1994) can be used to construct a fore- 
casting strategy achieving a regret of order a/M|A*| In N + MlniV, 
where iV is the number of actions, M is a known upper bound on 
the magnitude of payoffs < M for all t and i), and \X*\ is the 

absolute value of the cumulative payoff of the best action (i.e., the 
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largest cumulative payoff in a gain game or the smallest cumulative 
loss in a loss game), see also Auer, Cesa-Bianchi, and Gentile (2002). 

This bound is good when \X*\ is small in the one-sided game; that 
is, when the best action has a small gain (in a gain game) or a small 
loss (in a loss game). However, one often expects the best expert to 
be effective (for instance, because we have many experts and at least 
one of them is accurate). An effective expert in a loss game suffers a 
small cumulative loss, but in a gain game, such an expert should get a 
large cumulative payoff X*. To obtain a bound that is good when \X*\ 
is large one could apply the translation x\ t = Xi t t — M (from gains to 
losses) or the translation x' it = Xi t t + M (from losses to gains). In both 
cases one would obtain a bound of the form y/M(Mn — |X*|) In AT, 
which is now suited for effective experts in gain games and poor experts 
in loss games, but not for effective experts in loss games and poor 
experts in gain games. Since the original bound is not stable under 
the operation of conversion from one type of one-sided game into the 
other, the forecaster has to guess whether to play the original game 
or its translated version, depending on his beliefs on the quality of the 
experts and on the nature of the game (losses or gains). 

In Corollary 4 we use the sharper bound of Corollary 2 to prove a 
(first-order) bound of the form 



This is indeed an improvement for small losses or large gains, though it 
requires knowledge of M. However, in Remark 2 we will indicate how 
to extend this result to the case when M is not known beforehand. 
Note that the (second-order) bound of Corollary 3 also yields the same 
result without any preliminary knowledge of M. 

We thus recover an earlier result by Allenberg-Neeman and Neeman 
(2004). They proved, in a gain game, for a related algorithm, and 
with the previous knowledge of a bound M on the payoffs, a bound 
whose main term is 11.4v / Mmin {^/Xj, yfMn - X*}. That algorithm 
was specifically designed to ensure a regret bound of this form, and 
is different from the algorithm whose performance we discussed before 
the statement of Corollary 1, whereas we obtain the improvements for 
small losses or large gains as corollaries of much more general bounds 
that have other consequences. 

5.2.1. Analysis for exponentially weighted forecasters 
The main drawback of V n , used in Theorem 6, is that it is defined 
directly in terms of the forecaster's distributions pt- We now show how 
this dependence could be removed. 
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Corollary 3. Consider the weighted majority forecaster run with the 
time- varying learning rate (16). Then, for all sequences of payoffs in a 
one-sided game (i.e., payoffs are all nonpositive or all nonnegative) , 



X„ > X*-4^\X*\ (^M — IniV- 39Mmax{l,lniV} 

where M = max f= i n max^i,...^ \x itt \. 

Proof. We give the proof for a gain game. Since the payoffs are in 
[0, M], we can write 

n I N / n \ 2 \ n 

t=i \ i=\ \i=\ ) ) t=i 

I n \ n J J \ n I 

where we used the concavity of x i— > Mx — x 2 . Assume that X n < X* 
(otherwise the result is trivial). Then, Theorem 6 ensures that 



x n — X n > 



X!, \ M- IiiN-k 



where k = 4M In N + 6M. We solve for X n obtaining 



X n - X* > -aJx* (m-^ + -}\uN-k- 16^ InTV . 
y V n nj n 

Using the crude upper bound X*/n < M and performing some simple 
algebra, we get the desired result. □ 

Similarly to the remark about constant factors in Section 4.4 the fac- 
tor 4 in Corollary 3 can be made as close as desired to 4\/e — 2 = 
2-\/2 \/2 (e — 2), which is not much larger than the best known lead- 
ing constant for improvements for small losses, 2s/2, see Auer, Cesa- 
Bianchi, and Gentile (2002). But here, we have in addition an improve- 
ment for large losses, and deal with unknown ranges M. (Note, similarly 
to the discussion in Section 4.4, the presence of the same small factor 
^2 (e - 2) w 1.2.) 

5.2.2. Analysis for prod-type forecasters 

Quite surprisingly, a bound of the same form as the one shown in 
Corollary 3 can be derived from Corollary 2. 
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Corollary 4- Given M > 0, for all n > 1 and all sequences of payoffs 
bounded by M, i.e., maxi<j<jv maxi< t < n \x^t\ < M, the cumulative 
reward of algorithm prod-Q(2M), run using translated payoffs x^j — 
in a one-sided game, is larger than 

X n > X* - 8^2M min {X*, Mn - X*} In N 
- 128 M In N - k - &^2M(\n N)k 

where 

K = 4M(l+log 4 n + 2(l+ L(log 2 lniV)/2j)lniv) 
= e(M(lnra) + M(lniV)(lnlniV)) . 

Proof. As in the proof of Corollary 3, it suffices to give the proof for 
a gain game. In fact, we apply below the bound of Corollary 2, which is 
invariant under the change £i t t = M — Xij that converts bounded losses 
into bounded nonnegative payoffs. 

The main term in the bound of Corollary 2, with the notations 
therein, involves 

max R* < min [m (x* + X n ) , M (2 Mn-X*- X n ) } . (19) 

Indeed, using that (a — b) 2 < a 2 + b 2 for a, b > 0, we get on the one 
hand, for all 1 < s < n, 

s 

K < E x K,t + x 2 <M (x k . i8 + x s )<m (x* + x n ) 

t=l 

whereas on the other hand, the same techniques yield 

K = E ((m -z* !>t ) -(m-52)) 2 
t=i 

< M (jMs - X* s ^j + [Ms - X a )) . 

Now, we note that for all s, < X* + M, and similarly, X s+ \ < 

X s + M. Thus we also have max s <„ R* < M (2Mn - X* - X n ). 
Corollary 2, combined with (19), yields 



X n > X* - 8 JM(ln N) min { (x* + X n ) , (2 Mn - X* - 5t n ) } - k 
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where k = AM[1 + log 4 n + 2(1 + [(log 2 In N) /2j ) In Nj . Without loss 
of generality, we may assume that X n < X* and get 

X n > X* - 8^2M(lnN)mm{x*, (Mn - X n )} - k . 

Solving for X n and performing simple algebra in case the minimum is 
achieved by the term containing X n concludes the proof. □ 

Remark 2. The forecasting strategy of Theorem 4, when used by a 
meta- forecaster translating the payoffs by xt, achieves an improvement 
for small or large payoffs of the form 

w / f ~xt sM s - x t l 

M \ / mm < max — — , max — > 

Y I s<n M s s<n M s J 

without previous knowledge of M. 



5.2.3. The case of signed games 

The proofs of Corollaries 3 and 4 reveal that the assumption of one- 
sidedness cannot be relaxed. However, we may also prove a version of 
the improvement for small losses or for large gains suited to signed 
games. Remember that, as explained in Section 2.3, a meta-forecaster 
may always convert a signed game into a one-sided game by performing 
a suitable translation on the payoffs, and then apply a strategy for 
one-sided games. Since Corollary 2 and Theorem 6 are stable under 
general translations, applying them to the payoffs Xi t or to a translated 
version of them x\ t results in the same bounds. If the translated version 
x\ t correspond to a one-sided game, then the bounds of Corollaries 3 
and 4 may be applied. Using x' it = x^t — min^^...^ Xjj > and 
iaxj = i ... n Xjt < for the analysis, we may show, for 



instance, that for any signed game the forecaster of Theorem 6 ensures 
that the regret is bounded by a quantity whose main term is less than 



mm 



, (In N) max V" ( x j t - min x i t 

Aj j=i,-,n V h i=i,...,N 




, (In AO min Yd max xa — Xjt 
Y 3=1,.,N \{^ \i=l,...,N ht 



This bound is obtained without any previous knowledge of a bound M 
on the payoffs, and is sharper than both bounds (2) and (3). It may be 
interpreted as an improvement for small or large cumulative payoffs. 
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5.3. What is a "fundamental" bound? 

Most of the known regret bounds are not stable under natural trans- 
formations of the payoffs, such as translations and rescalings. 1 If a 
regret bound is not stable, then a (meta-)prediction algorithm might 
be willing to manipulate the payoffs in order to achieve a better regret. 
However, in general it is hard to choose the payoff transformation that 
is best for a given and unknown sequence of payoffs. For this reason, we 
argue that regret bounds that are stable under payoff transformations 
are, in some sense, more fundamental than others. The bounds that 
we have derived in this paper are based on sums of squared payoffs. 
They are not only generally tighter than the previously known bounds, 
but also stable under different transformations, such as those described 
below (in what follows, we use x' i t to indicate a transformed payoff). 

Additive translations: x' it = Xi t t — fM- 

Note that the regret (of playing a fixed sequence p±, P2, ■ ■ ■) is not 
affected by this transformation. Hence, stable bounds should not change 
when payoffs are translated. As already explained in Section 5.2, trans- 
lations can be used to turn a gain game into a loss game and vice 
versa. 

The invariance by general translations is the hardest to obtain, and 
this paper is the first one to show tight translation-invariant bounds 
that depend on the specific sequence of payoffs rather than just on its 
length (see Corollary 2, Theorem 6 and some of their corollaries, e.g., 
Corollary 1). It is also important to remark that, in a stable bound, 
not only the leading term, but also the smaller order terms, have to 
be stable under translations. This is why the smaller order terms of 
Corollary 2 and Theorem 6 involve bounds on the payoff ranges Xij—Xjj 
rather than just on the payoffs x^t- 

Rescalings: x' it = ctXij, a > 0. 

As this transformation causes the regret to be multiplied by a factor of 
a, stable bounds should only change by the same factor a. Obtaining 
bounds that are stable under rescalings is not always easy when the 
payoff ranges are not known beforehand, or when we try to get bounds 
sharper than the basic zero-order bounds discussed in Section 2.1. For 
instance, the application of a doubling trick on the magnitude of the 

1 Here we do not distinguish between stable bounds and stable algorithms because 
all the stability properties we consider for the bounds are due to a corresponding 
stability of the prediction scheme they are derived from. When a stable algorithm 
does not achieve a stable bound, it suffices to optimize the bound in hindsight, 
thanks to the stability properties of the prediction scheme. 
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payoffs, or even the use of more sophisticated incremental techniques, 
may lead to small but undesirable M ln(Mn) terms, which behave badly 
upon rescalings. This was the case with the remainder term Mln(l + 
\X*\) in Theorem 2.1 by Auer, Cesa-Bianchi, and Gentile (2002) where 
they assume knowledge of the payoff range but seek sharper bounds. 

Note also that forecasters with scaling-invariant bounds should re- 
quire no previous knowledge on the payoff sequence (such as the payoff 
range) as this information is scale-sensitive. This is why, for instance, 
the bounds of Theorems 2 and 5 cannot be considered scaling-invariant. 
However, modifications of these forecasters that increase their adaptive- 
ness lead to Theorems 4 and 6. There we could derive scaling-invariant 
bounds by using forecasters based on updates which are defined in 
terms of quantities that already have this type of invariance. 

Whereas translation-invariant bounds that are also sharp are gen- 
erally hard to obtain, we feel that any bound can be made stable with 
respect to rescalings via a reasonably accurate analysis. 

Unstable bounds can lead the meta-forecaster to Cornelian dilem- 
mas. Consider for the instance the bound (4) by Allenberg-Neeman and 
Neeman (2004) . If we use a meta-forecaster that translates payoffs by a 
quantity fi t (possibly depending on past observations), then the bound 
takes the form 



Note that the choice fit = —M (or fi t = ^^j=i,...,N Xj t t) yields the 
improvement for small payoffs (2) and the choice fit = M (or fit = 
m3X-j=i,...,N Xjj) yields the improvement for large payoffs (3). In gen- 
eral, the above bound is tight if, for a large number of rounds, all 
payoffs Xjt at a given round t are close to a common value, and we 
may guess this value to choose fi t accordingly. In Section 5.2.3, on the 
other hand, we show that Corollaries 3 and 4 propose bounds that need 
no preliminary choices of fit and are better than both (2) and (3). 



We have analyzed forecasting algorithms that work indifferently in loss 
games, gain games, and signed games. In Corollary 2 and Theorem 6 we 
have shown, for these forecasters, sharp regret bounds that are stable 
under rescalings and general translations. These bounds lead to im- 
provements for small or large payoffs in one-sided games (Corollaries 3 




6. Discussion and open problems 
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and 4) and do not assume any preliminary information about the payoff 
sequence. 2 

A practical advantage of the weighted majority forecaster is that 
its update rule is completely incremental and never needs to reset the 
weights. This in contrast to the forecaster prod-MQ of Theorem 4 that 
uses a nested doubling trick. On the other hand, the bound proposed in 
Theorem 6 is not in closed form, as it still explicitly depends through 
V n on the forecaster's rewards xj. We therefore need to solve for the 
regrets as we did, for instance, in Sections 4.4 and 5.2. Finally, it was 
also noted in Section 4.4 that the weighted majority forecaster update 
is invariant under translations of the payoffs. This is not the case for 
the prod-type forecasters, which need to perform translations explicitly. 
Though in general it may be difficult to determine beforehand what a 
good translation could be, Corollaries 2 and 4, as well as Remark 1, 
indicate some general effective translation rules. 

Several issues are left open: 

Design and analyze incremental updates for the prod-type fore- 
casters of Section 3. 

Obtain second-order bounds with updates that are not multiplica- 
tive; for instance, updates based on the polynomial potentials (see 
Cesa-Bianchi and Lugosi, 2003). These updates could be used as 
basic ingredients to derive forecasters achieving optimal orders 
of magnitude on the regret when applied to problems such as 
nonstochastic multiarmed bandits, label-efficient prediction, and 
partial monitoring. Note that, to the best of our knowledge, in the 
literature about incomplete information problems only exponen- 
tially weighted averages have been able to achieve these optimal 
rates (see Section 4.5 and the references therein). 

Extend the analysis of prod-type algorithms to obtain an oracle 
inequality of the form 

X n > max I Xk n — lijQk,n^N) - 72 MlniV 

where 71 and 72 are absolute constants. Inequalities of this form 
can be viewed as game-theoretic versions of the model selection 
bounds in statistical learning theory. 

2 Whereas the bound of Theorem 6 is already stated this way, we recall that it 
is easy to modify the forecaster used to prove Corollary 2 in order to dispense with 
the need of any preliminary knowledge of a bound E on the payoff ranges. 
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Appendix 

Proof of Theorem 4 

We use some additional notation for the proof: (r, s) — 1 denotes the 
epoch right before (r, s); that is, (r, s— 1) when s > 0, and (r— 1, S r -i — 
Sr-2) when s = 0. For notational convenience, i(o,o)-i is conventionally- 
set to 0. 

Proof. The proof combines the techniques from Theorems 2 and 3. 
As in the proof of Theorem 3, we denote by (R, Sr — Sr-i) the index 
of the last epoch and let i(i?,s fl -5 fl _i) = n - 
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We assume R > 1 and Sr > 1. Otherwise, if R = 0, this means that 
Mt = M^ ) for all t < n — 1, and the strategy, and thus the proposed 
bound, reduces to the one of Theorem 2. The case Sr = is dealt with 
at the end of the proof. In particular, Sr > 1 implies that some epoch 
ended at time t when Q$ > 4 5 « _1 M t 2 . This implies that q > 4 S '- R ~ 1 (> 
1), which in turn implies 2 Sr < 2^/q and Sr < 1 + (log 2 q)/2. 

Denote M( R+1 ) = M n . Note that at time n we have either M n < 
M^ R \ implying M n = M^ R+1 ^ = M^ R \ or we have M n > M^ R \ im- 
plying M n = M^ 1 ) = 2M( R \ In both cases, M( fl ) < M {R+1 ^ < 2M. 
Furthermore, > 2 s - r M^ for each < r < s < R, and thus (11) 

holds for s < R with M tr replaced by . 

Similar to the proof of Theorem 2, for each epoch (r, s), let 

^(r.s) 1 ^(r.s) 1 ^(r.s) 1 

4 r,s) = E ^ r ' s) = E 4*. * (r,s) = E 2* 

*=*(r,s)-l + l *=*(r,s)-l+l *=*(r,s)-l+l 

where the sums are over all the time steps t in epoch (r, s) except the 
last one, t^ s y We also denote /c( r s ) = fc£ (rs) _i the index of the best 
overall expert up to time i( r )S ) — 1 (one time step before the end of 
epoch (r, s)). 

We upper bound the cumulative payoff of the best action as 

K<J2 f M(r+1) + ( S r ~ S r -l)Af « + 5r E^ 4^] (20) 
r=0 \ s=0 ' J 

by using the same argument by induction as in (10). More precisely, 
we write, for each (s,r), 

^(r,s).*(r,s)-l — -^fc (rs) + ^t(r,s)-l + ^(t-, 3 )-1 .*(r-,s) - 1 ~ 1 
< X (r,s) + M tl , ,+X k . llf , , , _1 . 

— ^(r.s) l (r,s)-l K (r,s)-li r (r,s)-l 1 

We note that Mt (r s)1 = M<» whenever < s < S r - SV-i and 
M t , , = Af( r+1 ) otherwise. This and 

t '(r,s) — 1 



X* < X*_, + m(« +1 ) = X^ ^ ^ + m^ +1 

show (20) by induction. 
Let 

R 

K = E ( M(r+1) + (S r - S r _i)MW 

r=0 
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To show a bound on k note that (11) implies 

R 

M( r+1 ) < 2MW + M^ +1 ) < 3M^ +1 ) < 6M (21) 

r=0 

and 

R 

J2(S r - S r -i)M^ < 2MS R < M (2 + log 2 q) . 

r=0 

Thus, k < (8 + log 2 g)M . 

Now, similarly to the above bound on X*, 

R Sr 5V — 1 
r=0 s=0 

so that the regret X„ — X* is larger than 

x n - x* n > -2k + £ 5r xf (* (r,s) - • 

r=0 s=0 

Now note that each time step t (but the last one) of epoch (r, s) satisfies 
M t < M (r) and r?( rjS ) < l/2M (r) . Therefore, we can apply Lemma 2 to 

X( r ' s ) — xj[' s \ for each epoch (r, s). This gives 



c (r,s) 

_R S r — S r -1 



r=0 s=0 V ''( r > s ) 

By definition of the algorithm, for all epochs (r, s), 

qS£, < o^.v,.,-! = QtV, s) -i < 

and 

r? (r . s) < v / hTiV/ / (2 5 ''- 1+s M (r) ) . 

Therefore, 

5V Sr— 1 

E E 

r=0 s=0 

St- Sr—1 

< J2 E 2 5 '- 1+s M( r ) v / ln7V 

r=0 s=0 
_R Sr—Sr-1 R 

-EE 2^- 1+s (2M) v 1nlV + ^2^-wWv / hTiV : 

r=0 s=l r=0 
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Sr R 

s=l r=0 



< (2M)2 Sr+1 V\^N + 2 5 «(4M)v / h7iV (22) 

(using (11) and < 2M) 

< (16M)y/ql^N 

since q > A Sr ~ 1 implies 2 s R < 2^fq. 

We now turn our attention to the remaining sum 



E E 



r=0 s=0 Vfas) 

By definition of the algorithm, 



V(r,s) 



1/(2M«) if S r -i + s < [(log 2 lnA0/2l 

VhTTV/ (2 5r - 1+s M (r) ) otherwise. 



We denote by (r*, s*) the last couple (r, s) for which r/ ryS = 1/(2M^). 
With obvious notation, a crude overapproximation leads to 

R Si — S r — 1 l at 
y-, y-, IniV 

r=0 s=0 ^( r > s ) 

Sr S r — 1 

< 2M^ r hnN + J2 E ^ Sr - 1+s M {r) VhTN . 

(r,s)<(r* ,s*) r=0 s=0 

We already have the upper bound (lQM)\/q IniV for the second sum. 
For the first one, we write 

2M (r) lniV 

(r,s)<(r*,s*) 

r* r*-l 

= J2 2Mir)lnN + E & ~ S r-i) (2M (r) ) IniV 

r=0 r=0 



+s* (2M( r *)) IniV 



< ^2M (r) lniV + (S r *-! + s*)(4M)lniV 



r=0 



< 2M(lniV)(3 + 2[(log 2 lniV)/2l) 

where we used (21). The proof is concluded in the case Sr > 1 by 
putting things together and performing some overapproximation. 
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W hen S r = 0, q = 1, k is simply less than 6M, (22) is less than 
8M\/ln iV, so that the bound holds as well in this case. □ 



Proof of Lemma 3 

We first note that Jensen's inequality implies that <E> is nonnegative. 

The proof below is a simple modification of an argument first pro- 
posed in Auer, Cesa-Bianchi, and Gentile (2002). Note that we consider 
real- valued (non necessarily nonnegative) payoffs in what follows. For 
t = 1, ...,n, we rewrite pn = Wij/Wt, where Wit = e f)tXi < t ^ 1 and 
Wt = Y^j=i w j,t (the payoffs -X^o are understood to equal 0, and thus, r/i 
may be any positive number satisfying r]i > 7/2)- Use w' it = e Vt - lXl ' t - 1 
to denote the weight w^t where the parameter rj t is replaced by 7] t -i. 
The associated normalization factor will be denoted by W[ = J2jLi w 'j,t- 
Finally, we use j\ to denote the expert with the largest cumulative 
payoff after the first t rounds (ties are broken by choosing the expert 
with smallest index). That is, Xj* >t = maxj<jvXj it . We also make use 
of the following technical lemma. 

Lemma 5. (Auer, Cesa-Bianchi, and Gentile, 2002j For all N > 2, 
for all 3 > a > 0, and for all d 1 ,...,d N >0 such that J2?=i e~ adl > 1, 

T N , e~ ad% B-a 
\n ^' l = ie - < — — In N . 



Proof (of Lemma 5). We begin by writing 
Ef =1 e 



N -adi p-ad. 
In ^lT 1 „. = In U=lG 



= -InE 
< [B-a)E[D] 



e (a-/3)D 



where we applied Jensen inequality to the random variable D taking 
value di with probability e~ adi / J2jL\ e~ adj for each j = 1, . . . , N. Since 
D takes at most N distinct values, its entropy H(D) is at most In N. 
Therefore 

lniV > H(D) = 




= aE [D] +lnJ2 e ~ Pdj > « E \ D \ 
3=1 
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where the last inequality holds since Eili e- ad ' > 1. Hence E [D] < 
(\nN)/a. As (3 > a by hypothesis, we can plug the bound on E [D] in 
the upper bound above and conclude the proof. □ 

Proof of Lemma 3. As it is usual in the analysis of the exponentially 
weighted average predictor, we study the evolution of \n(Wt+i/Wt). 
However, here we need to couple this term with \n.(wj* l: t/ w j* ,t+i) 
including in both terms the time- varying parameters rjt, Vt+i- Track- 
ing the currently best expert j\ is used to lower bound the weight 
ln(wj* it+ i/Wt+i). In fact, the weight of the overall best expert (after n 
rounds) could get arbitrarily small during the prediction process. We 
thus obtain the following 

~ ln — 777 ln -T77 

Vt W t vt+i Wt+i 



<Vt+i VtJ w jt)t +i Vt Wj^t+i/Wt+i Vt Wj;,t+i/Wl +1 
= (A) + (B) + (C) . 

We now bound separately the three terms on the right-hand side. The 
term (A) is easily bounded by using r] t +i < Vt and using the fact that 
jt is the index of the expert with largest payoff after the first t rounds. 
Therefore, Wj*f+i/Wt+i must be at least 1/N. Thus we have 

(A) = P--IW-^tL<(J--IW 
\Vt+i VtJ WjZ,t+i \Vt+i VtJ 

We proceed to bounding the term (B) as follows 

_ i ln ^,, + ./^^ llD EiI 1 e-»^r..^' 

r? t Wj * jt+1 /Wt + i vt Y,f=ia X i*t' t ~ Xj ' t > 
: ^^lnAr=fJ--I]lniV 



VtVt+i Wt+i Vt 

where the inequality is proven by applying Lemma 5 with di = Xj* >t — 
Xi :t . Note that di > since j t * is the index of the expert with largest 
payoff after the first t rounds and X)£Li e~ Vt+ldi > 1 as for i = jt we 
have di = 0. 

The term (C) is first split as follows, 

1 Wj* t/W t 1 Wj* t 1 
( C ) = — ln — 77777— = — In — ; + — In — — . 

Vt w j; ,t+il w t+i Vt Wj;, t +i Vt W t 
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We bound separately each one of the two terms on the right-hand side. 
For the first one, we have 

1 w j: _ vt i e^i-v*- 1 

— In — ; = — In tp = Xi* t—] — Xj* f . 

The second term is handled by using the very definition of 

Vt W t rjt W t 7] t ^ 

N 

= ^2pi,tXi, t + <f>(pt, Vt, x t ) ■ 

i=i 

Finally, we plug back in the main equation the bounds on the first two 
terms (A) and (B), and the bounds on the two parts of the term (C). 
After rearranging we obtain 

N 

< (X,*_ i)t _i - Xj*^ +J2PH X iJ + ^' X t) 

1 ^ + 1,/^,^ 



?7t+i Wt+i % w« 

+ 2 f— - — ) IniV . 
\Vt+i Vt) 

We apply the above inequalities to each t = 1, . . . , n and sum up using 



t=i J 



«■»> Ei~^4^) 5 4 h .^ - lnAr 



t=i 



to conclude the proof. □ 
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